{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<!--BOOK_INFORMATION-->\n",
    "<a href=\"https://www.packtpub.com/big-data-and-business-intelligence/machine-learning-opencv\" target=\"_blank\"><img align=\"left\" src=\"data/cover.jpg\" style=\"width: 76px; height: 100px; background: white; padding: 1px; border: 1px solid black; margin-right:10px;\"></a>\n",
    "*This notebook contains an excerpt from the book [Machine Learning for OpenCV](https://www.packtpub.com/big-data-and-business-intelligence/machine-learning-opencv) by Michael Beyeler.\n",
    "The code is released under the [MIT license](https://opensource.org/licenses/MIT),\n",
    "and is available on [GitHub](https://github.com/mbeyeler/opencv-machine-learning).*\n",
    "\n",
    "*Note that this excerpt contains only the raw code - the book is rich with additional explanations and illustrations.\n",
    "If you find this content useful, please consider supporting the work by\n",
    "[buying the book](https://www.packtpub.com/big-data-and-business-intelligence/machine-learning-opencv)!*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<!--NAVIGATION-->\n",
    "< [Implementing Our First Bayesian Classifier](07.01-Implementing-Our-First-Bayesian-Classifier.ipynb) | [Contents](../README.md) | [Discovering Hidden Structures with Unsupervised Learning](08.00-Discovering-Hidden-Structures-with-Unsupervised-Learning.ipynb) >"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Classifying Emails Using the Naive Bayes Classifier\n",
    "\n",
    "The final task of this chapter will be to apply our newly gained skills to a real spam filter!\n",
    "Naive Bayes classifiers are actually a very popular model for email filtering. Their naivety\n",
    "lends itself nicely to the analysis of text data, where each feature is a word (or a **bag of\n",
    "words**), and it would not be feasible to model the dependence of every word on every other\n",
    "word.\n",
    "\n",
    "A bunch of interesting email datasets are mentioned in the book.\n",
    "\n",
    "In this section, we will be using the Enrom-Spam dataset, which can be downloaded for free\n",
    "from the given website. However, if you followed the installation instructions at the\n",
    "beginning of this book and have downloaded the latest code from GitHub, you are already\n",
    "good to go!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading the dataset\n",
    "\n",
    "If you downloaded the latest code from GitHub, you will find a number of `.zip` files in the\n",
    "`notebooks/data/chapter7` directory. These files contain raw email data (with fields for\n",
    "To:, Cc:, and text body) that are either classified as spam (with the `SPAM = 1` class label) or\n",
    "not (also known as ham, the `HAM = 0` class label).\n",
    "\n",
    "We build a variable called sources, which contains all the raw data files:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "HAM = 0\n",
    "SPAM = 1\n",
    "datadir = 'data/chapter7'\n",
    "sources = [\n",
    "    ('beck-s.tar.gz', HAM),\n",
    "    ('farmer-d.tar.gz', HAM),\n",
    "    ('kaminski-v.tar.gz', HAM),\n",
    "    ('kitchen-l.tar.gz', HAM),\n",
    "    ('lokay-m.tar.gz', HAM),\n",
    "    ('williams-w3.tar.gz', HAM),\n",
    "    ('BG.tar.gz', SPAM),\n",
    "    ('GP.tar.gz', SPAM),\n",
    "    ('SH.tar.gz', SPAM)\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The first step is to extract these files into subdirectories. For this, we can use the\n",
    "`extract_tar` function we wrote in the previous chapter:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def extract_tar(datafile, extractdir):\n",
    "    try:\n",
    "        import tarfile\n",
    "    except ImportError:\n",
    "        raise ImportError(\"You do not have tarfile installed. \"\n",
    "                          \"Try unzipping the file outside of Python.\")\n",
    "\n",
    "    tar = tarfile.open(datafile)\n",
    "    tar.extractall(path=extractdir)\n",
    "    tar.close()\n",
    "    print(\"%s successfully extracted to %s\" % (datafile, extractdir))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In order to apply the function to all data files in the sources, we need to run a loop. The\n",
    "`extract_tar` function expects a path to the `.tar.gz` file—which we build from `datadir`\n",
    "and an entry in sources—and a directory to extract the files to (`datadir`). This will extract\n",
    "all emails in, for example, `data/chapter7/beck-s.tar.gz` to the\n",
    "`data/chapter7/beck-s/` subdirectory:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "data/chapter7/beck-s.tar.gz successfully extracted to data/chapter7\n",
      "data/chapter7/farmer-d.tar.gz successfully extracted to data/chapter7\n",
      "data/chapter7/kaminski-v.tar.gz successfully extracted to data/chapter7\n",
      "data/chapter7/kitchen-l.tar.gz successfully extracted to data/chapter7\n",
      "data/chapter7/lokay-m.tar.gz successfully extracted to data/chapter7\n",
      "data/chapter7/williams-w3.tar.gz successfully extracted to data/chapter7\n",
      "data/chapter7/BG.tar.gz successfully extracted to data/chapter7\n",
      "data/chapter7/GP.tar.gz successfully extracted to data/chapter7\n",
      "data/chapter7/SH.tar.gz successfully extracted to data/chapter7\n"
     ]
    }
   ],
   "source": [
    "for source, _ in sources:\n",
    "    datafile = '%s/%s' % (datadir, source)\n",
    "    extract_tar(datafile, datadir)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now here's the tricky bit. Every one of these subdirectories contains a number of other\n",
    "directories, wherein the text files reside. So we need to write two functions:\n",
    "- `read_single_file(filename)`: This is a function that extracts the relevant content from a single file called `filename`\n",
    "- `read_files(path)`: This is a function that extracts the relevant content from all files in a particular directory called `path`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import os\n",
    "def read_single_file(filename):\n",
    "    past_header, lines = False, []\n",
    "    if os.path.isfile(filename):\n",
    "        f = open(filename, encoding=\"latin-1\")\n",
    "        for line in f:\n",
    "            if past_header:\n",
    "                lines.append(line)\n",
    "            elif line == '\\n':\n",
    "                past_header = True\n",
    "        f.close()\n",
    "    content = '\\n'.join(lines)\n",
    "    return filename, content"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def read_files(path):\n",
    "    for root, dirnames, filenames in os.walk(path):\n",
    "        for filename in filenames:\n",
    "            filepath = os.path.join(root, filename)\n",
    "            yield read_single_file(filepath)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Building a data matrix using Pandas\n",
    "\n",
    "Now it's time to introduce another essential data science tool that comes preinstalled with\n",
    "Python Anaconda: **Pandas**. Pandas is built on NumPy and provides a number of useful\n",
    "tools and methods to deal with data structures in Python. Just as we generally import\n",
    "NumPy under the alias `np`, it is common to import Pandas under the `pd` alias:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Pandas provides a useful data structure called `DataFrame`, which can be understood as a\n",
    "generalization of a 2D NumPy array, as shown here:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>class</th>\n",
       "      <th>model</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>cv2.ml.NormalBayesClassifier_create()</td>\n",
       "      <td>Normal Bayes</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>sklearn.naive_bayes.MultinomialNB()</td>\n",
       "      <td>Multinomial Bayes</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>sklearn.naive_bayes.BernoulliNB()</td>\n",
       "      <td>Bernoulli Bayes</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                   class              model\n",
       "0  cv2.ml.NormalBayesClassifier_create()       Normal Bayes\n",
       "1    sklearn.naive_bayes.MultinomialNB()  Multinomial Bayes\n",
       "2      sklearn.naive_bayes.BernoulliNB()    Bernoulli Bayes"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.DataFrame({\n",
    "    'model': ['Normal Bayes', 'Multinomial Bayes', 'Bernoulli Bayes'],\n",
    "    'class': [\n",
    "        'cv2.ml.NormalBayesClassifier_create()',\n",
    "        'sklearn.naive_bayes.MultinomialNB()',\n",
    "        'sklearn.naive_bayes.BernoulliNB()'\n",
    "    ]\n",
    "})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can combine the preceding functions to build a Pandas DataFrame from the extracted\n",
    "data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "def build_data_frame(extractdir, classification):\n",
    "    rows = []\n",
    "    index = []\n",
    "    for file_name, text in read_files(extractdir):\n",
    "        rows.append({'text': text, 'class': classification})\n",
    "        index.append(file_name)\n",
    "\n",
    "    data_frame = pd.DataFrame(rows, index=index)\n",
    "    return data_frame"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We then call it with the following command:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = pd.DataFrame({'text': [], 'class': []})\n",
    "for source, classification in sources:\n",
    "    extractdir = '%s/%s' % (datadir, source[:-7])\n",
    "    data = data.append(build_data_frame(extractdir, classification))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Preprocessing the data\n",
    "\n",
    "Scikit-learn offers a number of options when it comes to encoding text features, which we\n",
    "discussed in [Chapter 4](04.00-Representing-Data-and-Engineering-Features.ipynb), *Representing Data and Engineering Features*. One of the simplest\n",
    "methods of encoding text data, we recall, is by **word count**: For each phrase, you count the\n",
    "number of occurrences of each word within it. In scikit-learn, this is easily done using\n",
    "`CountVectorizer`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(52076, 643270)"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn import feature_extraction\n",
    "counts = feature_extraction.text.CountVectorizer()\n",
    "X = counts.fit_transform(data['text'].values)\n",
    "X.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The result is a giant matrix, which tells us that we harvested a total of 52,076 emails that\n",
    "collectively contain 643,270 different words. However, scikit-learn is smart and saved the\n",
    "data in a sparse matrix:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<52076x643270 sparse matrix of type '<class 'numpy.int64'>'\n",
       "\twith 8607632 stored elements in Compressed Sparse Row format>"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In order to build the vector of target labels (`y`), we need to access data in the Pandas\n",
    "DataFrame. This can be done by treating the DataFrame like a dictionary, where the `values`\n",
    "attribute will give us access to the underlying NumPy array:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "y = data['class'].values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training a normal Bayes classifier\n",
    "\n",
    "From here on out, things are (almost) like they always were. We can use scikit-learn to split\n",
    "the data into training and test sets. (Let's reserve 20% of all data points for testing):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn import model_selection as ms\n",
    "X_train, X_test, y_train, y_test = ms.train_test_split(\n",
    "    X, y, test_size=0.2, random_state=42\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can instantiate a new normal Bayes classifier with OpenCV:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import cv2\n",
    "model_norm = cv2.ml.NormalBayesClassifier_create()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "However, OpenCV does not know about sparse matrices (at least its Python interface does\n",
    "not). If we were to pass `X_train` and `y_train` to the train function like we did earlier,\n",
    "OpenCV would complain that the data matrix is not a NumPy array. But converting the\n",
    "sparse matrix into a regular NumPy array will likely make you run out of memory.\n",
    "\n",
    "Thus, a\n",
    "possible workaround is to train the OpenCV classifier only on a subset of data points (say\n",
    "1,000) and features (say 300):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "X_train_small = X_train[:1000, :300].toarray().astype(np.float32)\n",
    "y_train_small = y_train[:1000]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then it becomes possible to train the OpenCV classifier (although this might take a while):\n",
    "\n",
    "> It appears that `NormalBayesClassifier` is broken in OpenCV 3.1 (segmentation fault). As a result, the kernel will die."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "# model_norm.train(X_train_small, cv2.ml.ROW_SAMPLE, y_train_small)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training on the full dataset\n",
    "\n",
    "However, if you want to classify the full dataset, we need a more sophisticated approach.\n",
    "We turn to scikit-learn's naive Bayes classifier, as it understands how to handle sparse\n",
    "matrices. In fact, if you didn't pay attention and treated `X_train` like every NumPy array\n",
    "before, you might not even notice that anything is different:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn import model_selection as ms\n",
    "X_train, X_test, y_train, y_test = ms.train_test_split(\n",
    "    X, y, test_size=0.2, random_state=42\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we use `MultinomialNB` from the `naive_bayes` module, which is the version of\n",
    "naive Bayes classifier that is best suited to handle categorical data, such as word counts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn import naive_bayes\n",
    "model_naive = naive_bayes.MultinomialNB()\n",
    "model_naive.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The classifier is trained almost instantly and returns the scores for both the training and the\n",
    "test set:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.95026404224675953"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model_naive.score(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.948252688172043"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model_naive.score(X_test, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And there we have it: 94.4% accuracy on the test set! Pretty good for not doing much other\n",
    "than using the default values, isn't it?\n",
    "\n",
    "However, what if we were super critical of our own work and wanted to improve the result\n",
    "even further? There are a couple of things we could do."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using n-grams to improve the result\n",
    "\n",
    "One thing to do is to use **$n$-gram counts** instead of plain word counts. So far, we have relied\n",
    "on what is known as a **bag of words**: We simply threw every word of an email into a bag\n",
    "and counted the number of its occurrences. However, in real emails, the **order** in which\n",
    "**words appear** can carry a great deal of information!\n",
    "\n",
    "This is exactly what $n$-gram counts are trying to convey. You can think of an $n$-gram as a\n",
    "phrase that is $n$ words long. For example, the phrase *Statistics has its moments* contains the\n",
    "following 1-grams: *Statistics*, *has*, *its*, and *moments*. It also has the following 2-grams:\n",
    "*Statistics has*, *has its*, and *its moments*. It also has two 3-grams (*Statistics has its* and *has its\n",
    "moments*), and only a single 4-gram.\n",
    "\n",
    "We can tell CountVectorizer to include any order of $n$-grams into the feature matrix by\n",
    "specifying a range for $n$:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "counts = feature_extraction.text.CountVectorizer(\n",
    "    ngram_range=(1, 2)\n",
    ")\n",
    "X = counts.fit_transform(data['text'].values)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We then repeat the entire procedure of splitting the data and training the classifier:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn import model_selection\n",
    "X_train, X_test, y_train, y_test = model_selection.train_test_split(\n",
    "    X, y, test_size=0.2, random_state=42\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model_naive = naive_bayes.MultinomialNB()\n",
    "model_naive.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You might have noticed that the training is taking much longer this time. To our delight, we\n",
    "find that the performance has significantly increased:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.97081413210445466"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model_naive.score(X_test, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "However, $n$-gram counts are not perfect. They have the disadvantage of unfairly weighting\n",
    "longer documents (because there are more possible combinations of forming $n$-grams). \n",
    "\n",
    "To\n",
    "avoid this problem, we can use relative frequencies instead of a simple number of\n",
    "occurrences. We have already encountered one way to do so, and it had a horribly\n",
    "complicated name. \n",
    "\n",
    "Do you remember what it was called?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using tf-idf to improve the result\n",
    "\n",
    "It was called the **term–inverse document frequency (tf–idf)**, and we encountered it in\n",
    "[Chapter 4](04.00-Representing-Data-and-Engineering-Features.ipynb), *Representing Data and Engineering Features*. If you recall, what tf–idf does is\n",
    "basically weigh the word count by a measure of how often they appear in the entire dataset.\n",
    "A useful side effect of this method is the idf part—the inverse frequency of words. This\n",
    "makes sure that frequent words, such as and, the, and but, carry only a small weight in the\n",
    "classification.\n",
    "\n",
    "We apply tf–idf to the feature matrix by calling fit_transform on our existing feature\n",
    "matrix `X`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "tfidf = feature_extraction.text.TfidfTransformer()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "X_new = tfidf.fit_transform(X)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Don't forget to split the data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "X_train, X_test, y_train, y_test = ms.train_test_split(\n",
    "    X_new, y, test_size=0.2, random_state=42\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, when we train and score the classifier again, we suddenly find a remarkable score of\n",
    "99% accuracy!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.99039938556067586"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model_naive = naive_bayes.MultinomialNB()\n",
    "model_naive.fit(X_train, y_train)\n",
    "model_naive.score(X_test, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To convince us of the classifier's awesomeness, we can inspect the **confusion matrix**. This is\n",
    "a matrix that shows, for every class, how many data samples were misclassified as\n",
    "belonging to a different class.\n",
    "\n",
    "The diagonal elements in the matrix tell us how many\n",
    "samples of the class $i$ were correctly classified as belonging to the class $i$. The off-diagonal\n",
    "elements represent misclassifications:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn import metrics"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[3737,   93],\n",
       "       [   7, 6579]])"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "metrics.confusion_matrix(y_test, model_naive.predict(X_test))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "This tells us we got 3,746 class 0 classifications correct, and 6,575 class 1 classifications\n",
    "correct. We confused 84 samples of class 0 as belonging to class 1 and"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<!--NAVIGATION-->\n",
    "< [Implementing Our First Bayesian Classifier](07.01-Implementing-Our-First-Bayesian-Classifier.ipynb) | [Contents](../README.md) | [Discovering Hidden Structures with Unsupervised Learning](08.00-Discovering-Hidden-Structures-with-Unsupervised-Learning.ipynb) >"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
